# Lab 21 - Simple linear regression

We will use the dataset with information about the labor market for recent college graduates from Lab 20.  You can download the CSV file [here](http://comet.lehman.cuny.edu/owen/teaching/mat128/Feb2019_labor_market_majors.csv).

We need to download another library called `statsmodel`:

In [None]:
!pip install --user statsmodel

Import Seaborn and the other libraries so we can use them in our code.

In [15]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
%matplotlib inline

### Loading and cleaning the data

As in Lab 20 we need to read our CSV file into a dataframe and clean it.

Load the data in the dataframe `labor`, remembering to skip the non-data rows at the start and end of the file.

Display your `labor` dataframe below to check it was created properly.

To make things easier, we will rename the columns to shorter names without spaces.

In [10]:
labor.columns = ["major","unemployment","underemployment","early","mid","graduate"]

Next we need to remove the commas from the `early` and `mid` columns, and change the column type to float, as in Lab 20.

<details> <summary>Answer:</summary>
<code>labor["early"] = labor["early"].str.replace(",","").astype(float)
labor["mid"] = labor["mid"].str.replace(",","").astype(float)</code>
</details>

Check that this code worked correctly by displaying `labor` again:

### Simple linear regression

Recall from Lab 20 that the columns `early` (which was `Median Wage Early Career`) and `mid` (which was `Median Wage Mid-Career`) were the most correlated, with a correlation of 0.848. 

Let's remind ourselves what the relationship looked like by plotting a scatter plot with `early` on the x axis and `mid` on the y axis.

<details> <summary>Hint:</summary>
Scatter plot code pattern:
<code>df.plot.scatter(x = "column name 1", y = "column name 2")</code>
</details>

We can perform linear regression with the following code, which does not display anything:

In [16]:
lm = smf.ols(formula = 'mid ~ early', data = labor).fit()

The formula is always `dependent_variable ~ independent_variable`, and in this form, the variable names, which are the column names, cannot have spaces in them or else we need to use `Q('variable name with spaces')`.

All the information about the linear model is stored in the variable `lm`.  To see this information, type and run  `lm.summary()` below. 

This is a lot of information!  We will only focus on a few of the numbers.  The coefficients for the regression line can be found in the middle section, under the column `coef`: $\beta_0 = 1.014 \times 10^4 = 10140$ and $\beta_1 = 1.37632$.  Thus, the regression line equation is $y = 1.37632x + 10140$

You can also get the coefficients by typing `lm.params`:

We will use Seaborn to plot the regression line on the scatterplot in one step.  Type and run the code: `sns.regplot(x = 'early',y = 'mid', data = labor)`

To further check if the linear model is a good fit for the data, we can plot the residuals to see if they have a normal distribution.  To get the residuals, type `lm.resid` below.

Can you figure out how to plot the residuals as a histogram?  They are already stored as a Pandas Series.

Does the distribution of the residuals look normal?  Can you find the residual outlier on the scatter plot?  Can you figure out what major that is?  We looked at how to identify outliers in Lab 5. 

<details> <summary>Answer:</summary>
The outlier has the greatest mid-career median wage, so we can look for the row with the max value in this column:<br>
<code>max_row = labor["mid"].idxmax()</code><br>
and then display that row:<br>
<code>labor.iloc[max_row]</code>
</details>

### Regression line: underemployment vs. median wage early career

The next two most correlated columns were `underemployment` and `early`, with a correlation of -0.564108. Compute the linear regression model with `underemployment` as the independent variable and `early` as the dependent variable.

Display a summary of your linear regression model.

What is the equation for the linear regression line?

Use the Seaborn package to plot the scatter plot of underemployment vs. median wage early career with the regression line.

Plot the residuals as a histogram.

Is the distribution of the residuals roughly normal?  Is your linear regression model a good representation of the relationship between underemployment and median wage early career?

### Challenges
- In the green taxi trip data, what is the correlation between trip distance and fare amount?  Compute the linear regression line with trip distance as the independent variable and fare amount as the dependent variable.  Find the equation for the regression line, plot the scatterplot with the line, and check if the residuals are normally distributed.